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CLAIMS 
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1. Apparatus for processing image data and sound data, 
comprising: 

an imaVje processor for processing image data 
recorded by at\ least one camera showing the movements of 
a plurality of people to track each person in three 
dimensions ; 

a sound processor for processing sound data to 
determine the direction of arrival of the sound; 

a speaker identifier for determining which of the 
people is speaking based on the result of the processing 
performed by the limage processor and the result of the 
processing performed by the sound processor; and 

a voice recognition processor for processing the 
received sound datla to generate text data therefrom in 
dependence upon thel result of the processing performed by 
the speaker identifier. 
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2. Apparatus according to claim 1, wherein the voice 
recognition processor includes a store for storing 
respective voice recognition parameters for each of the 
people, and a selection processor for selecting the voice 
recognition parameters! to be used to process the sound 
data in dependence upon the person determined to be 
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speaking by the speaker identifier. 

3. \ Apparatus according to claim 1, wherein the image 
processor is arranged to track each person by processing 
the image data using camera calibration data defining the 
position and orientation of each camera from which image 
data is processed. 



4. Apparatus according to claim 1, wherein the image 
processor is arranged to track each person by tracking 
each person 1 s head . 



5. Apparatus according to claim 1, wherein the image 
processor! is arranged to process the image data to 
determine Jwhere at least each person who is speaking is 
looking, 



6 . Apparatus according to claim 1 , wherein the speaker 
identifier lis arranged to identify a person who is 
speaking inl a given frame of the received image data 
using the results of the processing performed by the 
image processor and the sound processor for at least one 
other frame if the speaker cannot be identified using the 
results of -fthe processing performed by the image 
processor and the sound processor for the given frame. 
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Apparatus according to claim 1, further comprising 



a database for storing at least some of the received 

j 

image data, the sound data, the text data produced by the 
voice recognition processor and viewing data defining 
where at least each person who is speaking is looking, 
the database being arranged to store the data such that 

j 

corresponding texjt data and viewing data are associated 
with each other anp with the corresponding image data and 
sound data. 



8- Apparatus according to claim 7, further comprising 
a data compressor for compressing the image data and the 
sound data for storlge in the database. 



15 9. Apparatus according to claim 8, wherein the data 

compressor comprises \a data encoder for encoding the 
image data and the soiAnd data as MPEG data. 

10. Apparatus according to claim 7, further comprising 
2 0 a gaze data generator for generating data defining, for 

a predetermined period, the proportion of time spent by 
a given person looking atleach of the other people during 
the predetermined period! and wherein the database is 
arranged to store the data! so that it is associated with 
25 the corresponding image data, sound data, text data and 
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viewing data!. 

11. Apparatuls according to claim 10, wherein the 
predetermined \ period comprises a period during which the 
given person was talking. 

VI . Apparatus fipr processing image data and sound data, 
comprising: \ 

an image processor for processing image data 
recorded by at least one camera showing the movements of 
a plurality of people to track each person in three 
dimensions; \ 

a sound processor for processing sound data to 
determine the direction of arrival of the sound; and 

a speaker identifier for determining which of the 
people is speaking based on the result of the processing 
performed by the image processor and the result of the 
processing performed by tihe sound processor. 

13. Apparatus according t® claim 12, wherein the image 
processor is arranged to track each person by processing 
the image data using camera calibration data defining the 
position and orientation of each camera from which image 
data is processed. \ 
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14. Apparatus! according to claim 12, wherein the image 
processor is arranged to track each person by tracking 
each person's head. 

15. Apparatus according to claim 12 , wherein the image 
processor xs arranged to process the image data to 
determine where at \Least each person who is speaking is 
looking. 
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16. Apparatus according to claim 12 , wherein the speaker 
identifier is arrange! to identify a person who is 
speaking in a given f ra^ne of the received image data 
using the results of the processing performed by the 



so\r 



image processor and the sound processor for at least one 
other frame if the speaker cannot be identified using the 
results of the processing \ performed by the image 
processor and the sound processor for the given frame. 
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A method of processing imaYje data and sound data, 
comprising: 

an image processing step ^comprising processing 
image data recorded by at least one camera showing the 
movements of a plurality of people \o track each person 
in three dimensions; 

a sound processing step comprising processing sound 
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data to deteVmine the direction of arrival of the sound; 

a speaker identification step comprising determining 
which of the people is speaking based on the result of 
the processing performed in the image processing step and 
the result of \ the processing performed in the sound 
processing steps and 

a voice recognition processing step comprising 
processing the received sound data to generate text data 
therefrom in dependence upon the result of the processing 
performed in the speaker identification step. 

18. A method according to claim 17, wherein, the voice 
recognition processing step includes selecting, from 
stored respective voice recognition parameters for each 
of the people, the voice recognition parameters to be 
used to process the sound data in dependence upon the 
person determined to b^e speaking in the speaker 
identification step . 

19. A method according to \claim 17, wherein, in the 
image processing step, each person is tracked by 
processing the image data usirig camera calibration data 
defining the position and orientation of each camera from 
which image data is processed. 
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20. A method according to claim 17, wherein, in the 
ing siep, each 



image process 
the person's head 



person is tracked by tracking 



21. A method accoraing to claim 17, wherein, in the 
image processing step, the image data is processed to 
determine where at l^ast each person who is speaking is 
looking. 
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22. A method accordii 
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to claim 17, wherein, in the 



speaker identification step, a person who is speaking in 
a given frame of the received image data is identified 
using the results of thie processing performed in the 
image processing step andlthe sound processing step for 
at least one other frame if the speaker cannot be 
identified using the results of the processing performed 
in the image processing step and the sound processing 
step for the given frame. 



20 



23. A method according to claim 17, further comprising 
the step of generating a signal\ conveying the text data 
generated in the voice recognition processing step. 



25 



24. A method according to claim vL7, further comprising 
the step of storing in a database! at least some of the 
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received image data, the sound data, the text data 
produced in the voice recognition processing step and 
viewing data defining where at least each person who is 
speaking is lookAig, the data being stored in the 
database such that corresponding text data and viewing 
data are associated^ with each other and with the 
corresponding image data and sound data. 



25. A method according! to claim 24, wherein the image 
10 data and the sound data\are stored in the database in 

compressed form. 

26. A method according to \claim 25, wherein the image 
data and the sound data are stored as MPEG data, 

15 

27. A method according to claim 24, further comprising 
the steps of generating \iata defining, for a 
predetermined period, the proportion of time spent by a 
given person looking at each of the other people during 

2 0 the predetermined period, and storing the data in the 

database so that it is associated with the corresponding 
image data, sound data, text data and viewing data. 

28. A method according to claim\ 27, wherein the 
2 5 predetermined period comprises a period during which the 
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29. A method according to claim 24, further comprising 
the step of generating a signal conveying the database 
with data therein. 

30. A method according to claim 29, further comprising 
the step of recording the signal either directly or 
indirectly to generate a recording thereof. 

yC. A method of processing image data and sound data, 
comprising: 

an image processing \step comprising processing image 
data recorded by at st one camera showing the 

movements of a plurality of people to track each person 
in three dimensions; 

a sound processing step Comprising processing sound 
data to determine the direction of arrival of the sound; 
and 

a speaker identification steft comprising determining 
which of the people is speaking based on the result of 
the processing performed in the image processing step and 
the result of the processing performed in the sound 
processing step. 
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32. A method according to claim 31, wherein, in the 
image processing step, each person is tracked by 
processing the image data using camera calibration data 
defining the position and orientation of each camera from 
which image data is processed. 

33. A method according to claim 31, wherein, in the 
image processing step, each person is tracked by tracking 
the person 1 s head A 




34. A method according to claim 31, wherein, in the 
image processing step, the image data is processed to 
determine where at lekst each person who is speaking is 
looking. 



35. A method according \ to claim 31, wherein, in the 



speaker identification step, a person who is speaking in 
a given frame of the received image data is identified 
using the results of the processing performed in the 
20 image processing step and th^y sound processing step for 

at least one other frame if\ the speaker cannot be 
identified using the results of ttihe processing performed 
in the image processing step anfi the sound processing 
step for the given frame. 

25 
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36. A method according to claim 31, further comprising 
the step of generating a signal conveying the identity of 
the speaker identified in the speaker identification 
step . 



37. A storage device storing instructions for causing a 
programmable processing apparatus to become configured as 
an apparatus as set out in at least one of claims 1 and 
12. 



38. A storage device storing instructions for causing a 
programmable processing^apparatus to become operable to 
perform a method as set out in at least one of claims 17 
and 3 1 . 



39. A signal conveying \instructions for causing a 
programmable processing apparatus to become configured as 
an apparatus as set out in a^ least one of claims 1 and 
12. 



40. A signal conveying instructions for causing a 
programmable processing apparatls to become operable to 
perform a method as set out in at\ least one of claims 17 
and 31. 
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4^. Apparatus! for processing image data and sound data, 
comprising: 

image processing means for processing image data 
recorded by at least one camera showing the movements of 
5 a plurality of people to track each person in three 

dimensions; 

sound processing means for processing sound data to 
determine the direction of arrival of the sound; 

speaker identifdcation means for determining which 
10 of the people is speaking based on the result of the 

processing performed by the image processing means and 
;L the result of the processing performed by the sound 

processing means; and \ 

K \ 

^ voice recognition processing means for processing 

^15 the received sound data Ao generate text data therefrom 

in dependence upon the result of the processing performed 

by the speaker identification means . 



Apparatus for processing image data and sound data, 
2 0 comprising: 

image processing means tor processing image data 
recorded by at least one camera! showing the movements of 
a plurality of people to tracyc each person in three 
dimensions ; 

25 sound processing means for processing sound data to 
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determine the direction of arrival of the sound; and 

speaker identification means for determining which 
of the people is speaking based on the result of the 
processing performed by the image processing means and 
the result of the\ processing performed by the sound 
processing means. \ 

Apparatus for processing image data and sound data, 
comprising: \ 

an imagel processor for processing image data 
recorded by at least one camera showing the movements of 
a plurality of pepple to determine where each person is 
looking and to determine which of the people is speaking 
based on wheref theVpeople are looking; and 

a sound Wciaessor for processing sound data 
defining words spoken by the people to generate text data 
therefrom in dependence upon the result of the processing 
performed by the image processor. 

44. Apparatus according to claim 43, wherein the sound 
processor includes a store for storing respective voice 
recognition parameters far each of the people, and a 
selection processor for selecting the voice recognition 
parameters to be used tol process the sound data in 
dependence upon the person Determined to be speaking by 
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the image processor. 

45. Apparatus! according to claim 43 , wherein the image 
processor is arranged to determine where each person is 
looking by processing the image data using camera 
calibration data\def ining the position and orientation of 
each camera from Which image data is processed. 
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46. Apparatus according to claim 43, wherein the image 
processor is arranged to determine where each person is 
looking by processing the image data to track the 
position and o^nta^lon of each person' s head in three 
dimensions . 
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47. Apparatus according to claim 43, wherein the image 
processor is arranged Yto determine which person is 
speaking based on the number of people looking at each 
person . 



20 



48. Apparatus according to\ claim 47, wherein the image 
processor is arranged to generate a value for each person 
defining at whom the person is looking and to process the 
values to determine the persom who is speaking. 



25 



49. Apparatus according to clajim 43, wherein the image 
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processor is\arranged to determine that the person who is 
speaking is tlje person at whom the most other people are 
looking . 

50. Apparatus according to claim 43, further comprising 
a database for storing the image data, the sound data, 
the text data produced by the sound processor and viewing 
data defining where each person is looking, the database 
being arranged to store the data such that corresponding 
text data and viewing data are associated with each other 
and with the corresponding image data and sound data. 



51. Apparatus ac 



ling to claim 50, further comprising 



a data compressor for compressing the image data and the 



sound data for storage 



the database. 



52. Apparatus according Vto claim 51, wherein the data 
compressor comprises a data encoder for encoding the 
image data and the sound data as MPEG data. 



53. Apparatus according to claim 50, further comprising 
a gaze data generator for generating data defining, for 
a predetermined period, the proportion of time spent by 
a given person looking at each of\the other people during 
the predetermined period, and wherein the database is 
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arranged to store the data so that it is associated with 
the corresponding image data, sound data, text data and 
viewing data, 

54. Apparatus \ according to claim 53, wherein the 
predetermined period comprises a period during which the 
given person was\talking. 
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5JS< Apparatus for\ processing image data, comprising: 

a receiver fort receiving image data recorded by at 

least one camera showing the movements of a plurality of 

people; and JT\ \ / 

an image processor for processing the image data to 

determine where each, person is looking and to determine 

which of the people is speaking based on where the people 

are looking. 
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56. Apparatus according no claim 55, wherein the image 
processor is arranged to determine where each person is 
looking by processing the\ image data using camera 
calibration data defining the\position and orientation of 
each camera from which image dlata is processed. 
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57. Apparatus according to claim 55, wherein the image 
processor is arranged to determine where each person is 
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looking \ by processing the image data to track the 
position \and orientation of each person's head in three 
dimensions^ 

58. Apparatus according to claim 55, wherein the image 
processor is\ arranged to determine which person is 
speaking basec^ on the number of people looking at each 
person. 

59. Apparatus accbrding to claim 58, wherein the image 
processor is arranged to generate a value for each person 




defining at whom 
values to determii 



rspn 



erson is looking and to process the 



e person who is speaking. 
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60. Apparatus accorcf^nc^ to claim 55, wherein the image 
processor is arranged to determine that the person who is 
speaking is the person at wljom the most other people are 
looking. 
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J&^r: A method of processing in^ge data and sound data, 
comprising: 

an image processing step\ comprising processing 
image data recorded by at least one camera showing the 
movements of a plurality of peoplW to determine where 
each person is looking and to determine which of the 
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people is speaking based on where the people are looking; 
and 

a sound processing step comprising processing sound 
data defining words spoken by the people to generate 
text data therefrom in dependence upon the result of the 
processing performed in the image processing step. 



62. A method according to claim 61, wherein the sound 
processing step includes selecting, from stored 
10 respective voice recognition parameters for each of the 

people, the voice recognition parameters to be used to 
process the sound data^ in dependence upon the person 
determined to be speaking in the image processing step. 




q 15 63. A method accdrding^ to claim 61, wherein, in the 

image processing stepL it is determined where each person 
is looking by processing -ohe image data using camera 
calibration data defining theyposition and orientation of 
each camera from which image data is processed, 

20 

64. A method according to clkim 61, wherein, in the 
image processing step, it is determined where each person 
is looking by processing the image data to track the 
position and orientation of each person's head in three 
2 5 dimensions . 
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65. A method according to claim 61, wherein, in the 
image processing ^step , it is determined which person is 
speaking based on\ the number of people looking at each 
person . 
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66. A method according to claim 65, wherein, in the 
image processing step, a value is generated for each 
person defining at whom the person is looking and the 
values are processed \to determine the person who is 
speaking. 
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67. A method according Wo claim 61, wherein, in the 
image processing step, Vt j\s determined that the person 
who is speaking is the person at whom the most other 
people are looking. 
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68. A method according to claim 61, further comprising 
the step of storing the image data, the sound data, the 
text data produced in the sounck processing step and 
viewing data defining where each person is looking in a 
database, the database being arranged to store the data 
such that corresponding text data alad viewing data are 
associated with each other and with\the corresponding 
image data and sound data. 
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69. A method according to claim 68, wherein the image 
data and the sound! data are stored in the database in 
compressed form. 



70. A method according to claim 69 f wherein the image 
data and the sound data are stored as MPEG data. 
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71. A method according to claim 68, further comprising 
the steps of generating data defining, for a 
predetermined period, tttie proportion of time spent by a 
given person looking atleacft' of the other people during 
the predetermined pei\ioa/ and storing the data in the 
database so that it ik Associated with the corresponding 
image data, sound dat'a, x tiext data and viewing data. 

72. A method according! to claim 71, wherein the 
predetermined period comprises a period during which the 
given person was talking. 
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73* A method according to claim 68, further comprising 
the step of generating a sipnal conveying the database 
with data therein. 
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74. A method according to claim 73, further comprising 
the step of recording the signal either directly or 
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indirectly to generate a recording thereof. 

A method of processing image data, comprising: 
receiving image data recorded by at least one camera 
showing the movementsj of a plurality of people; and 

processing the image data to determine where each 
person is looking and to determine which of the people is 
speaking based on wherL the people are looking. 
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76. A method according to claim 75, wherein it is 
determined where each person is looking by processing the 
image data using camera calibration data defining the 
position and orientajt^io^of each camera from which image 
data is processed. 

77. A method according \to claim 75, wherein it is 
determined where each person is looking by processing the 
image data to track the position and orientation of each 
person's head in three dimensions. 

78. A method according to claim 75, wherein it is 
determined which person is speaking based on the number 
of people looking at each person. 
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79. A method according to claim 78, wherein a value is 
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generated for each 
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person defining at whom the person is 
looking and the values are processed to determine the 
person who is speaWing, 



80. A method according to claim 75, wherein it is 
determined that the person who is speaking is the person 
at whom the most otheA people are looking. 
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81. A storage device st\oring instructions for causing a 
programmable processing efbparatus to become configured as 
an apparatus as set out in at least one of claims 43 and 
55. 
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82. A storage device st^riA^f instructions for causing a 
programmable processing\§^ppairatus' to become operable to 
perform a method as set out ih at least one of claims 61 
and 75. 
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83. A signal conveying instructions for causing a 
programmable processing apparatusvto become configured as 
an apparatus as set out in at leasij: one of claims 43 and 
55. 
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84. A signal conveying instructions for causing a 
programmable processing apparatus to become operable to 
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perform a method as\ set out in at least one of claims 61 
and 75. 
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8£ . Apparatus for processing image data and sound data, 
comprising : 

image processing means for processing image data 
recorded by at least orie camera showing the movements of 
a plurality of people tp determine where each person is 
looking and to determine which of the people is speaking 
based on where the peopllp are looking; and 

sound processin^\mekns/for processing sound data 
defining words spoken jW\t/f£e people to generate text data 
therefrom in dependenc^Aupoh the result of the processing 
performed by the image processing means. 



8^- Apparatus for processing image data, comprising: 

receiving means for receiving image data recorded by 

at least one camera showing th^p movements of a plurality 

of people; and 

means for processing the limage data to determine 

where each person is looking and to determine which of 

the people is speaking based o^i where the people are 

looking . 



